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Abstract 

We introduce a new representation learning approach for domain adaptation, in which 
data at training and test time come from similar but different distributions. Our approach 
is directly inspired by the theory on domain adaptation suggesting that, for effective do¬ 
main transfer to be achieved, predictions must be made based on features that cannot 
discriminate between the training (source) and test (target) domains. 

The approach implements this idea in the context of neural network architectures that 
are trained on labeled data from the source domain and unlabeled data from the target do¬ 
main (no labeled target-domain data is necessary). As the training progresses, the approach 
promotes the emergence of features that are (i) discriminative for the main learning task 
on the source domain and (ii) indiscriminate with respect to the shift between the domains. 
We show that this adaptation behaviour can be achieved in almost any feed-forward model 
by augmenting it with few standard layers and a new gradient reversal layer. The resulting 
augmented architecture can be trained using standard backpropagation and stochastic gra¬ 
dient descent, and can thus be implemented with little effort using any of the deep learning 
packages. 

We demonstrate the success of our approach for two distinct classification problems 
(document sentiment analysis and image classification), where state-of-the-art domain 
adaptation performance on standard benchmarks is achieved. We also validate the ap¬ 
proach for descriptor learning task in the context of person re-identification application. 
Keywords: domain adaptation, neural network, representation learning, deep learning, 
synthetic data, image classification, sentiment analysis, person re-identification 
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1. Introduction 

The cost of generating labeled data for a new machine learning task is often an obstacle 
for applying machine learning methods. In particular, this is a limiting factor for the fur¬ 
ther progress of deep neural network architectures, that have already brought impressive 
advances to the state-of-the-art across a wide variety of machine-learning tasks and appli¬ 
cations. For problems lacking labeled data, it may be still possible to obtain training sets 
that are big enough for training large-scale deep models, but that suffer from the shift in 
data distribution from the actual data encountered at “test time”. One important example 
is training an image classifier on synthetic or semi-synthetic images, which may come in 
abundance and be fully labeled, but which inevitably have a distribution that is different 
from real images (Liebelt and Schmid, 2010; Stark et ah, 2010; Vazquez et ah, 2014; Sun and 
Saenko, 2014). Another example is in the context of sentiment analysis in written reviews, 
where one might have labeled data for reviews of one type of product (e.^., movies), while 
having the need to classify reviews of other products (e.^., books). 

Learning a discriminative classifier or other predictor in the presence of a shift be¬ 
tween training and test distributions is known as domain adaptation (DA). The proposed 
approaches build mappings between the source (training-time) and the target (test-time) 
domains, so that the classifier learned for the source domain can also be applied to the 
target domain, when composed with the learned mapping between domains. The appeal 
of the domain adaptation approaches is the ability to learn a mapping between domains in 
the situation when the target domain data are either fully unlabeled {unsupervised domain 
annotation) or have few labeled samples {semi-supervised domain adaptation). Below, we 
focus on the harder unsupervised case, although the proposed approach {domain-adversarial 
learning) can be generalized to the semi-supervised case rather straightforwardly. 

Unlike many previous papers on domain adaptation that worked with fixed feature 
representations, we focus on combining domain adaptation and deep feature learning within 
one training process. Our goal is to embed domain adaptation into the process of learning 
representation, so that the final classification decisions are made based on features that 
are both discriminative and invariant to the change of domains, i.e., have the same or 
very similar distributions in the source and the target domains. In this way, the obtained 
feed-forward network can be applicable to the target domain without being hindered by 
the shift between the two domains. Our approach is motivated by the theory on domain 
adaptation (Ben-David et ah, 2006, 2010), that suggests that a good representation for 
cross-domain transfer is one for which an algorithm cannot learn to identify the domain of 
origin of the input observation. 

We thus focus on learning features that combine (i) discriminativeness and (ii) domain- 
invariance. This is achieved by jointly optimizing the underlying features as well as two 
discriminative classifiers operating on these features: (i) the label predictor that predicts 
class labels and is used both during training and at test time and (ii) the domain classifier 
that discriminates between the source and the target domains during training. While the 
parameters of the classifiers are optimized in order to minimize their error on the training set, 
the parameters of the underlying deep feature mapping are optimized in order to minimize 
the loss of the label classifier and to maximize the loss of the domain classifier. The latter 
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update thus works adversarially to the domain classifier, and it encourages domain-invariant 
features to emerge in the course of the optimization. 

Crucially, we show that all three training processes can be embedded into an appro¬ 
priately composed deep feed-forward network, called domain-adversarial neural network 
(DANN) (illustrated by Figure 1, page 12) that uses standard layers and loss functions, 
and can be trained using standard backpropagation algorithms based on stochastic gradi¬ 
ent descent or its modifications (e.^., SGD with momentum). The approach is generic as 
a DANN version can be created for almost any existing feed-forward architecture that is 
trainable by backpropagation. In practice, the only non-standard component of the pro¬ 
posed architecture is a rather trivial gradient reversal layer that leaves the input unchanged 
during forward propagation and reverses the gradient by multiplying it by a negative scalar 
during the backpropagation. 

We provide an experimental evaluation of the proposed domain-adversarial learning 
idea over a range of deep architectures and applications. We first consider the simplest 
DANN architecture where the three parts (label predictor, domain classifier and feature 
extractor) are linear, and demonstrate the success of domain-adversarial learning for such 
architecture. The evaluation is performed for synthetic data as well as for the sentiment 
analysis problem in natural language processing, where DANN improves the state-of-the-art 
marginalized Staeked Autoeneoders (mSDA) of Chen et al. (2012) on the common Amazon 
reviews benchmark. 

We further evaluate the approach extensively for an image classification task, and present 
results on traditional deep learning image data sets—such as MNIST (LeCun et ah, 1998) 
and SVHN (Netzer et ah, 2011)—as well as on Office benchmarks (Saenko et ah, 2010), 
where domain-adversarial learning allows obtaining a deep architecture that considerably 
improves over previous state-of-the-art accuracy. 

Finally, we evaluate domain-adversarial deseriptor learning in the context of person 
re-identification application (Gong et ah, 2014), where the task is to obtain good pedes¬ 
trian image descriptors that are suitable for retrieval and verification. We apply domain- 
adversarial learning, as we consider a deseriptor predietor trained with a Siamese-like loss 
instead of the label predictor trained with a classification loss. In a series of experiments, we 
demonstrate that domain-adversarial learning can improve cross-data-set re-identification 
considerably. 

2. Related work 

The general approach of achieving domain adaptation explored under many facets. Over the 
years, a large part of the literature has focused mainly on linear hypothesis (see for instance 
Blitzer et ah, 2006; Bruzzone and Marconcini, 2010; Germain et ah, 2013; Baktashmotlagh 
et ah, 2013; Cortes and Mohri, 2014). More recently, non-linear representations have become 
increasingly studied, including neural network representations (Glorot et ah, 2011; Li et ah, 
2014) and most notably the state-of-the-art mSDA (Chen et ah, 2012). That literature has 
mostly focused on exploiting the principle of robust representations, based on the denoising 
autoencoder paradigm (Vincent et ah, 2008). 

Concurrently, multiple methods of matching the feature distributions in the source and 
the target domains have been proposed for unsupervised domain adaptation. Some ap- 
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proaches perform this by reweighing or selecting samples from the source domain (Borg- 
wardt et ah, 2006; Huang et ah, 2006; Gong et ah, 2013), while others seek an explicit 
feature space transformation that would map source distribution into the target one (Pan 
et ah, 2011; Gopalan et ah, 2011; Baktashmotlagh et ah, 2013). An important aspect 
of the distribution matching approach is the way the (dis)similarity between distributions 
is measured. Here, one popular choice is matching the distribution means in the kernel- 
reproducing Hilbert space (Borgwardt et ah, 2006; Huang et ah, 2006), whereas Gong et al. 
(2012) and Fernando et al. (2013) map the principal axes associated with each of the dis¬ 
tributions. 

Our approach also attempts to match feature space distributions, however this is accom¬ 
plished by modifying the feature representation itself rather than by reweighing or geometric 
transformation. Also, our method uses a rather different way to measure the disparity be¬ 
tween distributions based on their separability by a deep discriminatively-trained classifier. 
Note also that several approaches perform transition from the source to the target domain 
(Gopalan et ah, 2011; Gong et ah, 2012) by changing gradually the training distribution. 
Among these methods, Chopra et al. (2013) does this in a “deep” way by the layerwise 
training of a sequence of deep autoencoders, while gradually replacing source-domain sam¬ 
ples with target-domain samples. This improves over a similar approach of Glorot et al. 
(2011) that simply trains a single deep autoencoder for both domains. In both approaches, 
the actual classifier/predictor is learned in a separate step using the feature representation 
learned by autoencoder(s). In contrast to Glorot et al. (2011); Chopra et al. (2013), our 
approach performs feature learning, domain adaptation and classifier learning jointly, in a 
unified architecture, and using a single learning algorithm (backpropagation). We therefore 
argue that our approach is simpler (both conceptually and in terms of its implementation). 
Our method also achieves considerably better results on the popular Office benchmark. 

While the above approaches perform unsupervised domain adaptation, there are ap¬ 
proaches that perform supervised domain adaptation by exploiting labeled data from the 
target domain. In the context of deep feed-forward architectures, such data can be used 
to “fine-tune” the network trained on the source domain (Zeiler and Fergus, 2013; Oquab 
et ah, 2014; Babenko et ah, 2014). Our approach does not require labeled target-domain 
data. At the same time, it can easily incorporate such data when they are available. 

An idea related to ours is described in Goodfellow et al. (2014). While their goal is 
quite different (building generative deep networks that can synthesize samples), the way 
they measure and minimize the discrepancy between the distribution of the training data 
and the distribution of the synthesized data is very similar to the way our architecture 
measures and minimizes the discrepancy between feature distributions for the two domains. 
Moreover, the authors mention the problem of saturating sigmoids which may arise at the 
early stages of training due to the significant dissimilarity of the domains. The technique 
they use to circumvent this issue (the “adversarial” part of the gradient is replaced by a 
gradient computed with respect to a suitable cost) is directly applicable to our method. 

Also, recent and concurrent reports by Tzeng et al. (2014); Long and Wang (2015) 
focus on domain adaptation in feed-forward networks. Their set of techniques measures and 
minimizes the distance between the data distribution means across domains (potentially, 
after embedding distributions into RKHS). Their approach is thus different from our idea 
of matching distributions by making them indistinguishable for a discriminative classifier. 
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Below, we compare our approach to Tzeng et al. (2014); Long and Wang (2015) on the 
Office benchmark. Another approach to deep domain adaptation, which is arguably more 
different from ours, has been developed in parallel by Chen et al. (2015). 

From a theoretical standpoint, our approach is directly derived from the seminal theo¬ 
retical works of Ben-David et al. (2006, 2010). Indeed, DANN directly optimizes the notion 
of ^/-divergence. We do note the work of Huang and Yates (2012), in which HMM repre¬ 
sentations are learned for word tagging using a posterior regularizer that is also inspired 
by Ben-David et al.’s work. In addition to the tasks being different—Huang and Yates 
(2012) focus on word tagging problems—, we would argue that DANN learning objective 
more closely optimizes the divergence, with Huang and Yates (2012) relying on cruder 
approximations for efficiency reasons. 

A part of this paper has been published as a conference paper (Ganin and Lempitsky, 
2015). This version extends Ganin and Lempitsky (2015) very considerably by incorporat¬ 
ing the report Ajakan et al. (2014) (presented as part of the Second Workshop on Transfer 
and Multi-Task Learning)^ which brings in new terminology, in-depth theoretical analy¬ 
sis and justification of the approach, extensive experiments with the shallow DANN case 
on synthetic data as well as on a natural language processing task (sentiment analysis). 
Furthermore, in this version we go beyond classification and evaluate domain-adversarial 
learning for descriptor learning setting within the person re-identification application. 

3. Domain Adaptation 

We consider classification tasks where X is the input space and Y = {0,1,..., L—1} is the 
set of L possible labels. Moreover, we have two different distributions over XxY, called the 
source domain Vs and the target domain Pt- An unsupervised domain adaptation learning 
algorithm is then provided with a labeled source sample S drawn i.i.d. from Ps? ^md an 
unlabeled target sample T drawn i.i.d. from V^^ where V^ is the marginal distribution of 
Dt over X. 

S = {(X,,!/.)}”., ~ (Vs )"; T = ~ (CJ)”', 

with N = n -\- n' being the total number of samples. The goal of the learning algorithm is 
to build a classifier g : X ^ Y with a low target risk 

Rvjiv) (??(x)^y), 

while having no information about the labels of V^- 

3.1 Domain Divergence 

To tackle the challenging domain adaptation task, many approaches bound the target error 
by the sum of the source error and a notion of distance between the source and the target 
distributions. These methods are intuitively justified by a simple assumption: the source 
risk is expected to be a good indicator of the target risk when both distributions are similar. 
Several notions of distance have been proposed for domain adaptation (Ben-David et ah, 
2006, 2010; Mansour et ah, 2009a b; Germain et ah, 2013). In this paper, we focus on the 
^/-divergence used by Ben-David et al. (2006, 2010), and based on the earlier work of Kifer 
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et al. (2004). Note that we assume in definition 1 below that the hypothesis class is a 
(discrete or continuous) set of binary classifiers 77 : X ^ { 0 , 1 }.^ 

Definition 1 (Ben-David et al., 2006, 2010; Kifer et al., 2004) Given two domain 
distributions and over X, and a hypothesis elass the 7 ^-divergence between 
Pg and is 


[v{^) = 1] - Pr^ [??(x) = 1] . 

That is, the 7^-divergence relies on the capacity of the hypothesis class H to distinguish 
between examples generated by Pg from examples generated by V^. Ben-David et al. 
(2006, 2010 ) proved that, for a symmetric hypothesis class one can compute the empirieal 
T-L-divergenee between two samples S ^ ^ ^ by computing 


dn{T>^,T>^) = 2 sup 

neH 


dH{S,T) 


2 1 — min 

\ rien 




i=l 


i=n+l 


( 1 ) 


where I[a] is the indicator function which is 1 if predicate a is true, and 0 otherwise. 


3.2 Proxy Distance 

Ben-David et al. (2006) suggested that, even if it is generally hard to compute d'u^S^T) 
exactly (e.^., when T-L is the space of linear classifiers on X), we can easily approximate 
it by running a learning algorithm on the problem of discriminating between source and 
target examples. To do so, we construct a new data set 

U = {(x.,0)}”.iU{(x,,l)}£„+i, (2) 

where the examples of the source sample are labeled 0 and the examples of the target sample 
are labeled 1. Then, the risk of the classifier trained on the new data set U approximates the 
“min” part of Equation (1). Given a generalization error e on the problem of discriminating 
between source and target examples, the ^/-divergence is then approximated by 

dA = 2(1-26). (3) 

In Ben-David et al. (2006), the value d^ is called the Proxy A-distanee (PAD). The A- 
distanee being defined as (/^(^S’^t) ~ ^ | (^) ~ (^) I’ where ^ is a 

subset of X. Note that, by choosing A = G ?/}, with the set represented by the 

characteristic function 77 , the A-distance and the ^/-divergence of Definition 1 are identical. 

In the experiments section of this paper, we compute the PAD value following the 
approach of Glorot et al. (2011); Chen et al. (2012), i.e., we train either a linear SVM or 
a deeper MLP classifier on a subset of U (Equation 2), and we use the obtained classifier 
error on the other subset as the value of e in Equation (3). More details and illustrations 
of the linear SVM case are provided in Section 5.1.5, 

1. As mentioned by Ben-David et al. (2006), the same analysis holds for multiclass setting. However, to 
obtain the same results when \Y\ > 2, one should assume that 7/ is a symmetrical hypothesis class. That 
is, for all h G 7/ and any permutation of labels c : Y ^ V, we have c(h) G 7/. Note that this is the case 
for most commonly used neural network architectures. 
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3.3 Generalization Bound on the Target Risk 

The work of Ben-David et al. (2006, 2010) also showed that the ?^-divergence dq-i{V^ 
is upper bounded by its empirical estimate dq-i{S^T) plus a constant complexity term that 
depends on the VC dimension of T-L and the size of samples S and T. By combining this 
result with a similar bound on the source risk, the following theorem is obtained. 

Theorem 2 (Ben-David et al., 2006) Let T-L he a hypothesis elass of VC dimension d. 
With probability 1 — 6 over the ehoiee of samples S ^ and T ^ , for every 

T] G Ti: 

RvM < Rsiv) + (rflog ^ + log I) + dn{S,T) +4^1 {dlog f + log |) + /?, 

with /3 > inf [i?pg(77*) + i?p^(77*)] , and 
r]*eV, 

^ m 

Rsiv) = ^Vi] 

n . 

1=1 

is the empirieal souree risk. 

The previous result tells us that Rx>rj.{r]) can be low only when the (3 term is low, i.e., only 
when there exists a classifier that can achieve a low risk on both distributions. It also tells 
us that, to find a classifier with a small Rx>rj.{r]) in a given class of fixed VC dimension, 
the learning algorithm should minimize (in that class) a trade-off between the source risk 
Rs{v) the empirical ^/-divergence d'u{S^ T). As pointed-out by Ben-David et al. (2006), 
a strategy to control the ^/-divergence is to find a representation of the examples where 
both the source and the target domain are as indistinguishable as possible. Under such a 
representation, a hypothesis with a low source risk will, according to Theorem 2, perform 
well on the target data. In this paper, we present an algorithm that directly exploits this 
idea. 

4. Domain-Adversarial Neural Networks (DANN) 

An original aspect of our approach is to explicitly implement the idea exhibited by Theo¬ 
rem 2 into a neural network classifier. That is, to learn a model that can generalize well 
from one domain to another, we ensure that the internal representation of the neural net¬ 
work contains no discriminative information about the origin of the input (source or target), 
while preserving a low risk on the source (labeled) examples. 

In this section, we detail the proposed approach for incorporating a “domain adaptation 
component” to neural networks. In Subsection 4.1, we start by developing the idea for the 
simplest possible case, i.e., a single hidden layer, fully connected neural network. We then 
describe how to generalize the approach to arbitrary (deep) network architectures. 

4.1 Example Case with a Shallow Neural Network 

Let us first consider a standard neural network (NN) architecture with a single hidden 
layer. For simplicity, we suppose that the input space is formed by m-dimensional real 
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vectors. Thus, X = The hidden layer Gf learns a function Gf : X ^ R^ that 

maps an example into a new ^^-dimensional representation^, and is parameterized by a 
matrix-vector pair (W,b) G R^><^ x R^ : 


with sigm(a) 


G/(x;W,b) = sigm(Wx + b), 


l+exp(-ai) 


( 4 ) 


Similarly, the prediction layer Gy learns a function Gy : R^ ^ [0,1]^ that is parame¬ 
terized by a pair (V,c) G R^^^ x R^: 

Gy(G/(x); V, c) = softmax(VG/(x)-h c) , 


with softmax(a) 


exp(ai) 


Here we have L = |y|. By using the softmax function, each component of vector 
Gy{Gf{'x.)) denotes the conditional probability that the neural network assigns x to the 
class in Y represented by that component. Given a source example (x^,^^), the natural 
classification loss to use is the negative log-probability of the correct label: 


Training the neural network then leads to the following optimization problem on the source 
domain: 


min 

W,b,V,c 


t V4(W,b,V,c) + A.i?(W,b) , 
n ^ ^ 


( 5 ) 


where 4(W,b,V,c)=/:j;(G'^(G/(x, ; W, b); V, c), is a shorthand notation for the pre¬ 
diction loss on the i-th example, and i?(W, b) is an optional regularizer that is weighted 
by hyper-parameter A. 


The heart of our approach is to design a domain regularizer directly derived from the 
H-divergence of Definition 1, To this end, we view the output of the hidden layer Gj(-) 
(Equation 4) as the internal representation of the neural network. Thus, we denote the 
source sample representations as 


S{Gf) = {G/(x)|xe4. 


Similarly, given an unlabeled sample from the target domain we denote the corresponding 
representations 

T{Gf) = {G/(x)|xeT}. 

Based on Equation (1), the empirical 7^-divergence of a symmetric hypothesis class % 
between samples S{Gf) and T{Gf) is given by 


dn{S{Gf),T{Gf)) 


= 2(1 — min 


1 ” 1 ^ 

i=l i=n+l 


■ ( 6 ) 


2. For brevity of notation, we will sometimes drop the dependence of G/ on its parameters (W, b) and 
shorten G/(x;W,b) to G/(x). 
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Let us consider % as the class of hyperplanes in the representation space. Inspired by the 
Proxy ^-distance (see Section 3.2), we suggest estimating the “min” part of Equation (6) 
by a domain classification layer Gd that learns a logistic regressor Gd : ^ [O?!]? 

parameterized by a vector-scalar pair (u, z) G x R, that models the probability that a 
given input is from the source domain Pg or the target domain Vf . Thus, 

Gd(G'/(x); u, z) = sigm(u’^G/(x) + z) . (7) 

Hence, the function Gd{-) is a domain regressor. We define its loss by 

£,(0,(0/(x,)),<!.) = l-0.(0Ax.)) ' 

where di denotes the binary variable [domain label) for the i-th example, which indicates 
whether come from the source distribution (x^^Dg if di—0) or from the target distribu¬ 
tion if di=l). 

Recall that for the examples from the source distribution (d^=0), the corresponding 
labels yi ^ Y are known at training time. For the examples from the target domains, we 
do not know the labels at training time, and we want to predict such labels at test time. 
This enables us to add a domain adaptation term to the objective of Equation (5), giving 
the following regularizer: 


i?(W,b) 


max 

u,z 


i E 4(W,b,u,z) 

' ^ ' -1 ' ^ ' -1 

1=1 i=n+l 


( 8 ) 


where b, u, z)=£d{GdiGf{xi; W, b); u, z)^di). This regularizer seeks to approximate 

the ^/-divergence of Equation (6), as 2(1 — i?(W, b)) is a surrogate for d'u[S{Gf)^T{Gf)). In 
line with Theorem 2, the optimization problem given by Equations (5) and (8) implements a 
trade-off between the minimization of the source risk Rs{’) and the divergence (i^(-, •). The 
hyper-parameter A is then used to tune the trade-off between these two quantities during 
the learning process. 

For learning, we first note that we can rewrite the complete optimization objective of 
Equation (5) as follows: 


E;(W,V,b,c,u,z) 

-t n 1 ^ 

= -E4(W,b,V,c)-A(-5;£i{W,b,u,2) + 


( 9 ) 

-^£MW,b,u,*-)), 

i=n+l 


where we are seeking the parameters W, V, b, c, u, z that deliver a saddle point given by 
(W,V,b,c) = argmin E(W, V, b, c, u, z ), 

W,V,b,c 

(u, z) = argmax E(W, V, b, c, u, z) . 

u,z 

Thus, the optimization problem involves a minimization with respect to some parameters, 
as well as a maximization with respect to the others. 
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Algorithm 1 Shallow DANN - Stochastic training update 


1: Input: 

— samples S = {(xi,yd}?=i and T = {xi}^=i, 

— hidden layer size U, 

— adaptation parameter A, 

— learning rate /x, 

2: Output: neural network {W, V, b, c} 

3: W, V ^ random_init( U) 

4: b, c, u, d ^ 0 

5: while stopping criterion is not met do 
6: for i from 1 to n do 

7: # Forward propagation 

8: ^ sigm(b + Wx^) 

9: Gy(G/(xd) ^ softmax(c + VG/(xd) 

10: ^ Backpropagation 

11: Ac <-(e(i/i) - Ga(G/(xi))) 

12: Av Ac G/(xi)^ 

13: Ab^ (V^Ac)©G/(xO©(l-G/(xi)) 

14: Aw <— Ab • (xj)^ 

15: ^ Domain adaptation regularizer... 

16: # ••-from current domain 

17: Gd(G/(xi)) sigm(d + u^G/(xi)) 

18: Ad^A(l-Gd(G/(xi))) 

19: Au^A(l-Gd(G/(xO))G/(xO 


20: Imp <- A(1 - Gd(G/(xi))) 

XU© G/(xi) © (1 - G/(xi)) 
21: Ab ^ Ab + tmp 

22: Aw Aw + tmp • (x^)"*" 

23: # ...from other domain 

24: jA— uniform_integer(l,..., n') 

25: G/(xj) ^ sigm(b + Wxj) 

26: Gd(G/(xj)) ^ sigm(d + u’^G/(xj)) 

27: Ad ^ Ad-AGd(G/(x, )) 

28: Au i — Au — AG(i(G/(xj))G/(xj) 

29: tmp ^- AGd(G/(xj)) 

X u 0 Gf(yij) O (1 - Gf(yij)) 
30: Ab ^ Ab + tmp 

31: Aw ^ Aw + tmp • (xj)^ 

32: ^ Update neural network parameters 

33: W ^ W - /xAw 

34: V ^ V - /xAv 

35: b ^ b — /xAb 

36: c ^ c — /xAc 

37: ^ Update domain classifier 

38: u ^ u + /xAu 

39: d i — d fi^d 

40: end for 

41: end while 


Note: In this pseudo-code, e(y) refers to a “one-hot” vector, consisting of all Os except for a 1 at position y, 
and O is the element-wise product. 


We propose to tackle this problem with a simple stochastic gradient procedure, in which 
updates are made in the opposite direction of the gradient of Equation (9) for the minimizing 
parameters, and in the direction of the gradient for the maximizing parameters. Stochastic 
estimates of the gradient are made, using a subset of the training samples to compute the 
averages. Algorithm 1 provides the complete pseudo-code of this learning procedure.^ In 
words, during training, the neural network (parameterized by W,b, V,c) and the domain 
regressor (parameterized by u, z) are competing against each other, in an adversarial way, 
over the objective of Equation (9). Eor this reason, we refer to networks trained according 
to this objective as Domain-Adversarial Neural Networks (DANN). DANN will effectively 
attempt to learn a hidden layer Gj(-) that maps an example (either source or target) into 
a representation allowing the output layer Gy(-) to accurately classify source samples, but 
crippling the ability of the domain regressor G(i{-) to detect whether each example belongs 
to the source or target domains. 


3. We provide an implementation of Shallow DANN algorithm at http://graal.ift.ulaval.ca/danii/ 
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4.2 Generalization to Arbitrary Architectures 

For illustration purposes, we’ve so far focused on the case of a single hidden layer DANN. 
However, it is straightforward to generalize to other sophisticated architectures, which might 
be more appropriate for the data at hand. For example, deep convolutional neural networks 
are well known for being state-of-the-art models for learning discriminative features of 
images (Krizhevsky et ah, 2012). 

Let us now use a more general notation for the different components of DANN. Namely, 
let Gf{’]9f) be the L)-dimensional neural network feature extractor, with parameters Of. 
Also, let Gy{’]9y) be the part of DANN that computes the network’s label prediction out¬ 
put layer, with parameters while Gd{-]9d) now corresponds to the computation of the 
domain prediction output of the network, with parameters 9d> Note that for preserving 
the theoretical guarantees of Theorem 2, the hypothesis class T-Ld generated by the domain 
prediction component Gd should include the hypothesis class T-Ly generated by the label 
prediction component Gy. Thus, 1-Ly C l-Ld- 

We will note the prediction loss and the domain loss respectively by 

Cy{9f^9y) = Cy(jGy{Gf{'Ki] 9f)] 9y)^yi^ ^ 

Q{ef,ed) = Cd{Gd{Gf{Ki-ef)-ed),di). 

Training DANN then parallels the single layer case and consists in optimizing 

-i 1 ^ 1 ^ 

£(»/, + «.<))• 

i=l i=l i=n+l 

by finding the saddle point 9f^9y^9d such that 

{Of, By) = argmin E{6f,6y,0d) 

Of.Oy 

9d = argmax E{9f, 9y, 9d) 

Od 

As suggested previously, a saddle point defined by Equations (11 12) can be found as a 
stationary point of the following gradient updates: 


(11) 

( 12 ) 


9f ft 






(13) 

(14) 

(15) 


where /i is the learning rate. We use stochastic estimates of these gradients, by sampling 
examples from the data set. 

The updates of Equations (13 15) are very similar to stochastic gradient descent (SGD) 
updates for a feed-forward deep model that comprises feature extractor fed into the label 
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Figure 1: The proposed architecture includes a deep feature extraetor (green) and a deep 
label predietor (blue), which together form a standard feed-forward architecture. 
Unsupervised domain adaptation is achieved by adding a domain elassifier (red) 
connected to the feature extractor via a gradient reversal layer that multiplies 
the gradient by a certain negative constant during the backpropagation-based 
training. Otherwise, the training proceeds standardly and minimizes the label 
prediction loss (for source examples) and the domain classification loss (for all 
samples). Gradient reversal ensures that the feature distributions over the two 
domains are made similar (as indistinguishable as possible for the domain classi¬ 
fier), thus resulting in the domain-invariant features. 


predictor and into the domain classifier (with loss weighted by A). The only difference is 
that in (13), the gradients from the class and domain predictors are subtracted, instead of 
being summed (the difference is important, as otherwise SGD would try to make features 
dissimilar across domains in order to minimize the domain classification loss). Since SGD— 
and its many variants, such as ADAGRAD (Duchi et ah, 2010) or ADADELTA (Zeiler, 
2012)—is the main learning algorithm implemented in most libraries for deep learning, it 
would be convenient to frame an implementation of our stochastic saddle point procedure 
as SGD. 

Fortunately, such a reduction can be accomplished by introducing a special gradient 
reversal layer (GRL), defined as follows. The gradient reversal layer has no parameters 
associated with it. During the forward propagation, the GRL acts as an identity trans¬ 
formation. During the backpropagation however, the GRL takes the gradient from the 
subsequent level and changes its sign, i.e., multiplies it by —1, before passing it to the 
preceding layer. Implementing such a layer using existing object-oriented packages for deep 
learning is simple, requiring only to define procedures for the forward propagation (identity 
transformation), and backpropagation (multiplying by —1). The layer requires no parame¬ 
ter update. 

The GRL as defined above is inserted between the feature extractor Gf and the domain 
classifier resulting in the architecture depicted in Figure 1, As the backpropagation 
process passes through the GRL, the partial derivatives of the loss that is downstream 


12 










































Domain-Adversarial Neural Networks 


the GRL (i.e., Cd) w.r.t. the layer parameters that are upstream the GRL (i.e., Of) get 
multiplied by —1, i.e., is effectively replaced with — Therefore, running SGD in 
the resulting model implements the updates of Equations (13 15) and converges to a saddle 
point of Equation (10). 

Mathematically, we can formally treat the gradient reversal layer as a “pseudo-function” 
7^(x) defined by two (incompatible) equations describing its forward and backpropagation 
behaviour: 


7^(x) = X, 

(16) 

dn 

dx ’ 

(17) 


where I is an identity matrix. We can then define the objective “pseudo-function” of 
(Of^Oy^Od) that is being optimized by the stochastic gradient descent within our method: 

1 "" 

EiOf, By, 0d)=-J2 {GyiGfi^r, 9fy,ey),y^) (18) 

i=l 

^ n 1 ^ 

- A (- ^ {GdiniGfi^r,ef))-6^), di) + ^J2 (G'd(7^(G/(x,; 0 /)); 0^), ) • 

i=l i=n-\-l 

Running updates (13 15) can then be implemented as doing SGD for (18) and leads 
to the emergence of features that are domain-invariant and discriminative at the same 
time. After the learning, the label predictor Gy(G/(x; Of); Oy) can be used to predict labels 
for samples from the target domain (as well as from the source domain). Note that we 
release the source code for the Gradient Reversal layer along with the usage examples as 
an extension to Caffe (Jia et ah, 2014).^ 

5. Experiments 

In this section, we present a variety of empirical results for both shallow domain adversarial 
neural networks (Subsection 5.1) and deep ones (Subsections 5.2 and 5.3). 

5.1 Experiments with Shallow Neural Networks 

In this first experiment section, we evaluate the behavior of the simple version of DANN 
described by Subsection 4.1, Note that the results reported in the present subsection are 
obtained using Algorithm 1 , Thus, the stochastic gradient descent approach here consists of 
sampling a pair of source and target examples and performing a gradient step update of all 
parameters of DANN. Crucially, while the update of the regular parameters follows as usual 
the opposite direction of the gradient, for the adversarial parameters the step must follow 
the gradient’s direction (since we maximize with respect to them, instead of minimizing). 

5.1.1 Experiments on a Toy Problem 

As a first experiment, we study the behavior of the proposed algorithm on a variant of the 
inter-twinning moons 2D problem, where the target distribution is a rotation of the source 

4. http://sites.skoltech.ru/compvision/projects/grl/ 
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Label classification Representation PCA Domain classification Hidden neurons 
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(a) Standard NN. For the “domain classification”, we use a non adversarial domain regressor on the hidden 
neurons learned by the Standard NN. (This is equivalent to run Algorithm 1, without Lines 22 and 31) 


+■*■ -V 




(b) DANN (Algorithm 1) 

Figure 2: The inter-twinning moons toy problem. Examples from the source sample are rep¬ 
resented as a (label 1) and a “—’’(label 0), while examples from the unlabeled 
target sample are represented as black dots. See text for the figure discussion. 


one. As the source sample 5, we generate a lower moon and an upper moon labeled 0 and 1 
respectively, each of which containing 150 examples. The target sample T is obtained by 
the following procedure: (1) we generate a sample S' the same way S has been generated; 
(2) we rotate each example by 35°; and (3) we remove all the labels. Thus, T contains 300 
unlabeled examples. We have represented those examples in Figure 2, 

We study the adaptation capability of DANN by comparing it to the standard neural net¬ 
work (NN). In these toy experiments, both algorithms share the same network architecture, 
with a hidden layer size of 15 neurons. We train the NN using the same procedure as the 
DANN. That is, we keep updating the domain regressor component using target sample T 
(with a hyper-parameter A = 6; the same value is used for DANN), but we disable the adver¬ 
sarial back-propagation into the hidden layer. To do so, we execute Algorithm 1 by omitting 
the lines numbered 22 and 31, This allows recovering the NN learning algorithm—based 
on the source risk minimization of Equation (5) without any regularizer—and simultane¬ 
ously train the domain regressor of Equation (7) to discriminate between source and target 
domains. With this toy experience, we will first illustrate how DANN adapts its decision 
boundary when compared to NN. Moreover, we will also illustrate how the representation 
given by the hidden layer is less adapted to the source domain task with DANN than with 
NN (this is why we need a domain regressor in the NN experiment). We recall that this is 
the founding idea behind our proposed algorithm. The analysis of the experiment appears 
in Figure 2, where upper graphs relate to standard NN, and lower graphs relate to DANN. 
By looking at the lower and upper graphs pairwise, we compare NN and DANN from four 
different perspectives, described in details below. 
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The column “Label Classification” of Figure 2 shows the decision boundaries of 
DANN and NN on the problem of predicting the labels of both source and the target 
examples. As expected, NN accurately classifies the two classes of the source sample 
but is not fully adapted to the target sample T. On the contrary, the decision boundary of 
DANN perfectly classifies examples from both source and target samples. In the studied 
task, DANN clearly adapts to the target distribution. 

The column “Representation PCA” studies how the domain adaptation regularizer 
affects the representation G/(-) provided by the network hidden layer. The graphs are ob¬ 
tained by applying a Principal component analysis (PCA) on the set of all representation of 
source and target data points, i.e., S{Gf)VJT{Gf). Thus, given the trained network (NN or 
DANN), every point from S and T is mapped into a 15-dimensional feature space through 
the hidden layer, and projected back into a two-dimensional plane by the PCA transforma¬ 
tion. In the DANN-PCA representation, we observe that target points are homogeneously 
spread out among source points; In the NN-PCA representation, a number of target points 
belong to clusters containing no source points. Hence, labeling the target points seems an 
easier task given the DANN-PCA representation. 

To push the analysis further, the PCA graphs tag four crucial data points by the letters 
A, B, C and D, that correspond to the moon extremities in the original space (note that 
the original point locations are tagged in the first column graphs). We observe that points 
A and B are very close to each other in the NN-PCA representation, while they clearly 
belong to different classes. The same happens to points C and D. Conversely, these four 
points are at the opposite four corners in the DANN-PCA representation. Note also that 
the target point A (resp. D)—that is difficult to classify in the original space—is located 
in the “+” cluster (resp. “—’’cluster) in the DANN-PCA representation. Therefore, the 
representation promoted by DANN is better suited to the adaptation problem. 

The column “Domain Classification” shows the decision boundary on the domain 
classification problem, which is given by the domain regressor Ga of Equation (7). More 
precisely, an example x is classified as a source example when Gd{Gf{'x.)) > 0.5, and is 
classified as a domain example otherwise. Remember that, during the learning process 
of DANN, the Gd regressor struggles to discriminate between source and target domains, 
while the hidden representation G/(*) is adversarially updated to prevent it to succeed. 
As explained above, we trained a domain regressor during the learning process of NN, but 
without allowing it to influence the learned representation Gf{-). 

On one hand, the DANN domain regressor clearly fails to generalize source and target dis¬ 
tribution topologies. On the other hand, the NN domain regressor shows a better (although 
imperfect) generalization capability. Inter alia, it seems to roughly capture the rotation an¬ 
gle of the target distribution. This again corroborates that the DANN representation does 
not allow discriminating between domains. 

The column “Hidden Neurons” shows the configuration of hidden layer neurons (by 
Equation 4, we have that each neuron is indeed a linear regressor). In other words, each 
of the fifteen plot line corresponds to the coordinates x E for which the i-th component 
of G/(x) equals for i E {!,..., 15}. We observe that the standard NN neurons are 
grouped in three clusters, each one allowing to generate a straight line of the zigzag decision 
boundary for the label classification problem. However, most of these neurons are also able 
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to (roughly) capture the rotation angle of the domain classification problem. Hence, we 
observe that the adaptation regularizer of DANN prevents these kinds of neurons to be 
produced. It is indeed striking to see that the two predominant patterns in the NN neurons 
(i.e., the two parallel lines crossing the plane from lower left to upper right) are vanishing 
in the DANN neurons. 

5.1.2 Unsupervised Hyper-Parameter Selection 

To perform unsupervised domain adaption, one should provide ways to set hyper-parameters 
(such as the domain regularization parameter A, the learning rate, the network architecture 
for our method) in an unsupervised way, i.e., without referring to labeled data in the 
target domain. In the following experiments of Sections 5.1.3 and 5.1.4, we select the 
hyper-parameters of each algorithm by using a variant of reverse eross-validation approach 
proposed by Zhong et al. (2010), that we call reverse validation. 

To evaluate the reverse validation risk associated to a tuple of hyper-parameters, we 
proceed as follows. Given the labeled source sample S and the unlabeled target sample T, 
we split each set into training sets {S' and T' respectively, containing 90% of the original 
examples) and the validation sets {Sy and Ty respectively). We use the labeled set S' 
and the unlabeled target set T' to learn a classifier 77 . Then, using the same algorithm, 
we learn a reverse classifier r]r using the self-labeled set {(x, 77(x))}xGr' and the unlabeled 
part of S' as target sample. Finally, the reverse classifier rjr is evaluated on the validation 
set Sy of source sample. We then say that the classifier 77 has a reverse validation risk of 
RsviVr)- The process is repeated with multiple values of hyper-parameters and the selected 
parameters are those corresponding to the classifier with the lowest reverse validation risk. 

Note that when we train neural network architectures, the validation set Sy is also 
used as an early stopping criterion during the learning of 77 , and self-labeled validation set 
{(x, 77(x))}xGrv is used as an early stopping criterion during the learning of rjr. We also 
observed better accuracies when we initialized the learning of the reverse classifier 77 ^ with 
the configuration learned by the network 77 . 

5.1.3 Experiments on Sentiment Analysis Data Sets 

We now compare the performance of our proposed DANN algorithm to a standard neural 
network with one hidden layer (NN) described by Equation (5), and a Support Vector 
Machine (SVM) with a linear kernel. We compare the algorithms on the Amazon reviews 
data set, as pre-processed by Chen et al. ( 2012 ). This data set includes four domains, each 
one composed of reviews of a specific kind of product (books, dvd disks, electronics, and 
kitchen appliances). Reviews are encoded in 5 000 dimensional feature vectors of unigrams 
and bigrams, and labels are binary: “0” if the product is ranked up to 3 stars, and “1” if 
the product is ranked 4 or 5 stars. 

We perform twelve domain adaptation tasks. All learning algorithms are given 2 000 
labeled source examples and 2 000 unlabeled target examples. Then, we evaluate them on 
separate target test sets (between 3000 and 6000 examples). Note that NN and SVM do 
not use the unlabeled target sample for learning. 

Here are more details about the procedure used for each learning algorithms leading to 
the empirical results of Table 1, 
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Original data 

mSDA representation 

Source 

Target 

DANN 

NN 

SVM 

DANN 

NN 

SVM 


BOOKS 

DVD 

.784 

.790 

.799 

.829 

.824 

.830 

BOOKS 

ELECTRONICS 

.733 

.747 

.748 

.804 

.770 

.766 

BOOKS 

KITCHEN 

.779 

.778 

.769 

.843 

.842 

.821 

DVD 

BOOKS 

.723 

.720 

.743 

.825 

.823 

.826 

DVD 

ELECTRONICS 

.754 

.732 

.748 

.809 

.768 

.739 

DVD 

KITCHEN 

.783 

.778 

.746 

.849 

.853 

.842 

ELECTRONICS 

BOOKS 

.713 

.709 

.705 

.774 

.770 

.762 

ELECTRONICS 

DVD 

.738 

.733 

.726 

.781 

.759 

.770 

ELECTRONICS 

KITCHEN 

.854 

.854 

.847 

.881 

.863 

.847 

KITCHEN 

BOOKS 

.709 

.708 

.707 

.718 

.721 

.769 

KITCHEN 

DVD 

.740 

.739 

.736 

.789 

.789 

.788 

KITCHEN 

ELECTRONICS 

.843 

.841 

.842 

.856 

.850 

.861 


(a) Classification 

accuracy on 

the Amazon reviews data set 



Original data 



mSDA representations 


DANN NN 

SVM 



DANN 

NN SVM 

DANN 

.50 .87 

.83 

DANN 

.50 

.92 

.88 

NN 

.13 .50 

.63 

NN 

.08 

.50 

.62 

SVM 

.17 .37 

.50 

SVM 

.12 

.38 

.50 


(b) Pairwise Poisson binomial test 


Table 1: Classification accuracy on the Amazon reviews data set, and Pairwise Poisson 
binomial test. 


• For the DANN algorithm, the adaptation parameter A is chosen among 9 values 
between 10“^ and 1 on a logarithmic scale. The hidden layer size I is either 50 or 100. 
Finally, the learning rate fi is fixed at 10“^. 

• For the NN algorithm, we use exactly the same hyper-parameters grid and training 
procedure as DANN above, except that we do not need an adaptation parameter. 
Note that one can train NN by using the DANN implementation (Algorithm 1) with 
A = 0. 

• For the SVM algorithm, the hyper-parameter C is chosen among 10 values between 
10“^ and 1 on a logarithmic scale. This range of values is the same as used by Chen 
et al. (2012) in their experiments. 

As presented at Section 5.1.2, we used reverse eross validation selecting the hyper-parameters 
for all three learning algorithms, with early stopping as the stopping criterion for DANN 
and NN. 
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The “Original data” part of Table la shows the target test accuracy of all algorithms, 
and Table lb reports the probability that one algorithm is significantly better than the oth¬ 
ers according to the Poisson binomial test (Lacoste et ah, 2012). We note that DANN has 
a significantly better performance than NN and SVM, with respective probabilities 0.87 
and 0.83. As the only difference between DANN and NN is the domain adaptation regu- 
larizer, we conclude that our approach successfully helps to find a representation suitable 
for the target domain. 

5.1.4 Combining DANN with Denoising Autoengoders 

We now investigate on whether the DANN algorithm can improve on the representation 
learned by the state-of-the-art Marginalized Stacked Denoising Autoencoders (mSDA) pro¬ 
posed by Chen et al. (2012). In brief, mSDA is an unsupervised algorithm that learns a 
new robust feature representation of the training samples. It takes the unlabeled parts of 
both source and target samples to learn a feature map from input space X to a new rep¬ 
resentation space. As a denoising autoencoders algorithm, it finds a feature representation 
from which one can (approximately) reconstruct the original features of an example from its 
noisy counterpart. Chen et al. (2012) showed that using mSDA with a linear SVM classifier 
reaches state-of-the-art performance on the Amazon reviews data sets. As an alternative 
to the SVM, we propose to apply our Shallow DANN algorithm on the same representa¬ 
tions generated by mSDA (using representations of both source and target samples). Note 
that, even if mSDA and DANN are two representation learning approaches, they optimize 
different objectives, which can be complementary. 

We perform this experiment on the same Amazon reviews data set described in the 
previous subsection. For each source-target domain pair, we generate the mSDA represen¬ 
tations using a corruption probability of 50% and a number of layers of 5. We then execute 
the three learning algorithms (DANN, NN, and SVM) on these representations. More pre¬ 
cisely, following the experimental procedure of Chen et al. (2012), we use the concatenation 
of the output of the 5 layers and the original input as the new representation. Thus, each 
example is now encoded in a vector of 30 000 dimensions. Note that we use the same grid 
search as in the previous Subsection 5.1.3, but use a learning rate ji of 10“^ for both DANN 
and the NN. The results of “mSDA representation” columns in Table la confirm that com¬ 
bining mSDA and DANN is a sound approach. Indeed, the Poisson binomial test shows 
that DANN has a better performance than the NN and the SVM, with probabilities 0.92 
and 0.88 respectively, as reported in Table lb, We note however that the standard NN 
and the SVM find the best solution on respectively the second and the fourth tasks. This 
suggests that DANN and mSDA adaptation strategies are not fully complementary. 

5.1.5 Proxy Distange 

The theoretical foundation of the DANN algorithm is the domain adaptation theory of Ben- 
David et al. (2006, 2010). We claimed that DANN finds a representation in which the source 
and the target example are hardly distinguishable. Our toy experiment of Section 5.1.1 
already points out some evidence for that and here we provide analysis on real data. To 
do so, we compare the Proxy M-distance (PAD) on various representations of the Amazon 
Reviews data set; these representations are obtained by running either NN, DANN, mSDA, 
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tions. 


Figure 3: Proxy ^-distances (PAD). Note that the PAD values of mSDA representations 
are symmetric when swapping source and target samples. 


or mSDA and DANN combined. Recall that PAD, as described in Section 3.2, is a metric 
estimating the similarity of the source and the target representations. More precisely, to 
obtain a PAD value, we use the following procedure: (1) we construct the data set U of 
Equation (2) using both source and target representations of the training samples; (2) we 
randomly split U in two subsets of equal size; (3) we train linear SVMs on the first subset 
of U using a large range of C values; (4) we compute the error of all obtained classifiers 
on the second subset of U] and (5) we use the lowest error to compute the PAD value of 
Equation (3). 

Firstly, Figure 3a compares the PAD of DANN representations obtained in the experi¬ 
ments of Section 5.1.3 (using the hyper-parameters values leading to the results of Table 1) 
to the PAD computed on raw data. As expected, the PAD values are driven down by the 
DANN representations. 

Secondly, Figure 3b compares the PAD of DANN representations to the PAD of standard 
NN representations. As the PAD is influenced by the hidden layer size (the discriminating 
power tends to increase with the representation length), we fix here the size to 100 neurons 
for both algorithms. We also fix the adaptation parameter of DANN to A ~ 0.31; it was 
the value that has been selected most of the time during our preceding experiments on the 
Amazon Reviews data set. Again, DANN is clearly leading to the lowest PAD values. 

Lastly, Figure 3c presents two sets of results related to Section 5.1.4 experiments. On 
one hand, we reproduce the results of Chen et al. (2012), which noticed that the mSDA 
representations have greater PAD values than original (raw) data. Although the mSDA 
approach clearly helps to adapt to the target task, it seems to contradict the theory of Ben- 
David et ah, On the other hand, we observe that, when running DANN on top of mSDA 
(using the hyper-parameters values leading to the results of Table 1), the obtained represen¬ 
tations have much lower PAD values. These observations might explain the improvements 
provided by DANN when combined with the mSDA procedure. 
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5.2 Experiments with Deep Networks on Image Classification 

We now perform extensive evaluation of a deep version of DANN (see Subsection 4.2) on a 
number of popular image data sets and their modifications. These include large-scale data 
sets of small images popular with deep learning methods, and the Office data sets (Saenko 
et ah, 2010), which are a de facto standard for domain adaptation in computer vision, but 
have much fewer images. 

5.2.1 Baselines 

The following baselines are evaluated in the experiments of this subsection. The source-only 
model is trained without consideration for target-domain data (no domain classifier branch 
included into the network). The train-on-target model is trained on the target domain with 
class labels revealed. This model serves as an upper bound on DA methods, assuming that 
target data are abundant and the shift between the domains is considerable. 

In addition, we compare our approach against the recently proposed unsupervised DA 
method based on subspace alignment (SA) (Fernando et ah, 2013), which is simple to setup 
and test on new data sets, but has also been shown to perform very well in experimental 
comparisons with other “shallow” DA methods. To boost the performance of this baseline, 
we pick its most important free parameter (the number of principal components) from the 
range {2,..., 60}, so that the test performance on the target domain is maximized. To apply 
SA in our setting, we train a source-only model and then consider the activations of the last 
hidden layer in the label predictor (before the final linear classifier) as descriptors/features, 
and learn the mapping between the source and the target domains (Fernando et ah, 2013). 

Since the SA baseline requires training a new classifier after adapting the features, and 
in order to put all the compared settings on an equal footing, we retrain the last layer of 
the label predictor using a standard linear SVM (Fan et ah, 2008) for all four considered 
methods (including ours; the performance on the target domain remains approximately the 
same after the retraining). 

For the Office data set (Saenko et ah, 2010), we directly compare the performance of 
our full network (feature extractor and label predictor) against recent DA approaches using 
previously published results. 

5.2.2 CNN architectures and Training Procedure 

In general, we compose feature extractor from two or three convolutional layers, picking their 
exact configurations from previous works. More precisely, four different architectures were 
used in our experiments. The first three are shown in Figure 4, For the Office domains, 
we use pre-trained AlexNet from the Caffe-package (Jia et ah, 2014). The adaptation 
architecture is identical to Tzeng et al. (2014).^ 

For the domain adaption component, we use three (t^ 1024^1024^2) fully connected 
layers, except for MNIST where we used a simpler (t^ 100^2) architecture to speed up 
the experiments. Admittedly these choices for domain classifier are arbitrary, and better 
adaptation performance might be attained if this part of the architecture is tuned. 


5. A 2-layer domain classifier (x^l024^1024^2) is attached to the 256-dimensional bottleneck of fc7. 
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(a) MNIST architecture; inspired by the classical LeNet-5 (LeCun et ah, 1998). 



(b) SVHN architecture; adopted from Srivastava et ah (2014). 



(c) GTSRB architecture; we used the single-CNN baseline from Ciresan et al. (2012) as our starting 
point. 

Figure 4: CNN architectures used in the experiments. Boxes correspond to transformations 
applied to the data. Color-coding is the same as in Figure 1, 


For the loss functions, we set Cy and Cd to be the logistic regression loss and the 
binomial cross-entropy respectively. Following Srivastava et al. (2014) we also use dropout 
and ^ 2 -norm restriction when we train the SVHN architecture. 

The other hyper-parameters are not selected through a grid search as in the small scale 
experiments of Section 5.1, which would be computationally costly. Instead, the learning 
rate is adjusted during the stochastic gradient descent using the following formula: 

_ /^o 

{l-\- a -pY ’ 

where p is the training progress linearly changing from 0 to 1 , po = 0 . 01 , a = 10 and 
13 — 0.75 (the schedule was optimized to promote convergence and low error on the source 
domain). A momentum term of 0.9 is also used. 

The domain adaptation parameter A is initiated at 0 and is gradually changed to 1 using 
the following schedule: 

2 

^ l + exp(- 7 -p) 

where 7 was set to 10 in all experiments (the schedule was not optimized/tweaked). This 
strategy allows the domain classifier to be less sensitive to noisy signal at the early stages of 
the training procedure. Note however that these \p were used only for updating the feature 
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MNIST —> MNIST-M: top feature extractor layer 





(a) Non-adapted 



(b) Adapted 


Syn Numbers ^ SVHN: last hidden layer of the label predictor 




(b) Adapted 


Figure 5: The effect of adaptation on the distribution of the extracted features (best viewed 
in color). The figure shows t-SNE (van der Maaten, 2013) visualizations of the 
CNN’s activations (a) in case when no adaptation was performed and (b) in 
case when our adaptation procedure was incorporated into training. Blue points 
correspond to the source domain examples, while red ones correspond to the target 
domain. In all cases, the adaptation in our method makes the two distributions 
of features much closer. 


extraetor component Gf. For updating the domain elassifieation component, we used a 
fixed A = 1, to ensure that the latter trains as fast as the label predietor Gy.^ 

Finally, note that the model is trained on 128-sized batches (images are preprocessed by 
the mean subtraction). A half of each batch is populated by the samples from the source 
domain (with known labels), the rest constitutes the target domain (with labels not revealed 
to the algorithms except for the train-on-target baseline). 

5.2.3 Visualizations 

We use t-SNE (van der Maaten, 2013) projection to visualize feature distributions at dif¬ 
ferent points of the network, while color-coding the domains (Figure 5). As we already 
observed with the shallow version of DANN (see Figure 2), there is a strong correspondence 

6. Equivalently, one can use the same Xp for both feature extractor and domain classification components, 
but use a learning rate of p/Xp for the latter. 
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between the success of the adaptation in terms of the classification accuracy for the target 
domain, and the overlap between the domain distributions in such visualizations. 

5.2.4 Results On Image Data Sets 

We now discuss the experimental settings and the results. In each case, we train on the 
source data set and test on a different target domain data set, with considerable shifts 
between domains (see Figure 6). The results are summarized in Table 2 and Table 3, 

MNIST MNIST-M. Our first experiment deals with the MNIST data set (LeCun et ah, 
1998) (source). In order to obtain the target domain (MNIST-M) we blend digits from the 
original set over patches randomly extracted from color photos from BSDS500 (Arbelaez 
et ah, 2011). This operation is formally defined for two images 7^,/^ as 
where i^j are the coordinates of a pixel and A; is a channel index. In other words, an output 
sample is produced by taking a patch from a photo and inverting its pixels at positions 
corresponding to the pixels of a digit. For a human the classification task becomes only 
slightly harder compared to the original data set (the digits are still clearly distinguishable) 
whereas for a CNN trained on MNIST this domain is quite distinct, as the background and 
the strokes are no longer constant. Consequently, the source-only model performs poorly. 
Our approach succeeded at aligning feature distributions (Figure 5), which led to successful 
adaptation results (considering that the adaptation is unsupervised). At the same time, 
the improvement over source-only model achieved by subspace alignment (SA) (Fernando 
et ah, 2013) is quite modest, thus highlighting the difficulty of the adaptation task. 

Synthetic numbers SVHN. To address a common scenario of training on synthetic data 
and testing on real data, we use Street-View House Number data set SVHN (Netzer et ah, 
2011) as the target domain and synthetic digits as the source. The latter (Syn Numbers) 
consists of 500,000 images generated by ourselves from Windows^^ fonts by varying the 
text (that includes different one-, two-, and three-digit numbers), positioning, orientation, 
background and stroke colors, and the amount of blur. The degrees of variation were chosen 
manually to simulate SVHN, however the two data sets are still rather distinct, the biggest 
difference being the structured clutter in the background of SVHN images. 

The proposed backpropagation-based technique works well covering almost 80% of the 
gap between training with source data only and training on target domain data with known 
target labels. In contrast, SA (Fernando et ah, 2013) results in a slight classification ac¬ 
curacy drop (probably due to the information loss during the dimensionality reduction), 
indicating that the adaptation task is even more challenging than in the case of the MNIST 
experiment. 

MNIST CA SVHN. In this experiment, we further increase the gap between distributions, 
and test on MNIST and SVHN, which are significantly different in appearance. Training 
on SVHN even without adaptation is challenging — classification error stays high during 
the first 150 epochs. In order to avoid ending up in a poor local minimum we, therefore, do 
not use learning rate annealing here. Obviously, the two directions (MNIST ^ SVHN and 
SVHN ^ MNIST) are not equally difficult. As SVHN is more diverse, a model trained 
on SVHN is expected to be more generic and to perform reasonably on the MNIST data 
set. This, indeed, turns out to be the case and is supported by the appearance of the 
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Figure 6: Examples of domain pairs used in the experiments. See Section 5.2.4 for details. 


Source 

Method 

Target 

MNIST 

Syn Numbers 

SVHN 

Syn Signs 

MNIST-M 

SVHN 

MNIST 

GTSRB 

Source only 

.5225 

.8674 

.5490 

.7900 

SA (Fernando et al., 2013) 

.5690 (4.1%) 

.8644 (-5.5%) 

.5932 (9.9%) 

.8165 (12.7%) 

DANN 

.7666 (52.9%) 

.9109 (79.7%) 

.7385 (42.6%) 

.8865 (46.4%) 

Train on target 

.9596 

.9220 

.9942 

.9980 


Table 2: Classification accuracies for digit image classifications for different source and 
target domains. MNIST-M corresponds to difference-blended digits over non- 
uniform background. The first row corresponds to the lower performance bound 
(i.e., if no adaptation is performed). The last row corresponds to training on 
the target domain data with known class labels (upper bound on the DA perfor¬ 
mance). For each of the two DA methods (ours and Fernando et ah, 2013) we 
show how much of the gap between the lower and the upper bounds was covered 
(in brackets). For all five cases, our approach outperforms Fernando et al. (2013) 
considerably, and covers a big portion of the gap. 


, ^ Source 

Method 

Target 

Amazon 

DSLR 

Webcam 

Webcam 

Webcam 

DSLR 

GFK(PLS, PGA) (Gong et al., 2012) 

.197 

.497 

.6631 

SA* (Fernando et al., 2013) 

.450 

.648 

.699 

DLID (Chopra et al., 2013) 

.519 

.782 

.899 

DDG (Tzeng et al., 2014) 

.618 

.950 

.985 

DAN (Long and Wang, 2015) 

.685 

.960 

.990 

SOURGE ONLY 

.642 

.961 

.978 

DANN 

.730 

.964 

.992 


Table 3: Accuracy evaluation of different DA approaches on the standard Office (Saenko 
et ah, 2010) data set. All methods (except SA) are evaluated in the “fully- 
transductive” protocol (some results are reproduced from Long and Wang, 2015). 
Our method (last row) outperforms competitors setting the new state-of-the-art. 
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Batches seen .]^q5 


Figure 7: Results for the traffic signs classification in the semi-supervised setting. Syn 
and Real denote available labeled data (100,000 synthetic and 430 real images 
respectively); Adapted means that 31,000 unlabeled target domain images were 
used for adaptation. The best performance is achieved by employing both the 
labeled samples and the large unlabeled corpus in the target domain. 


feature distributions. We observe a quite strong separation between the domains when we 
feed them into the CNN trained solely on MNIST, whereas for the SVHN-trained network 
the features are much more intermixed. This difference probably explains why our method 
succeeded in improving the performance by adaptation in the SVHN ^ MNIST scenario 
(see Table 2) but not in the opposite direction (SA is not able to perform adaptation in 
this case either). Unsupervised adaptation from MNIST to SVHN gives a failure example 
for our approach: it doesn’t manage to improve upon the performance of the non-adapted 
model which achieves 0.25 accuracy (we are unaware of any unsupervised DA methods 
capable of performing such adaptation). 

Synthetie Signs GTSRB. Overall, this setting is similar to the Syn Numbers ^ SVHN 
experiment, except the distribution of the features is more complex due to the significantly 
larger number of classes (43 instead of 10). For the source domain we obtained 100,000 
synthetic images (which we call Syn Signs) simulating various imaging conditions. In the 
target domain, we use 31,367 random training samples for unsupervised adaptation and the 
rest for evaluation. Once again, our method achieves a sensible increase in performance 
proving its suitability for the synthetic-to-real data adaptation. 

As an additional experiment, we also evaluate the proposed algorithm for semi-supervised 
domain adaptation, i.e., when one is additionally provided with a small amount of labeled 
target data. Here, we reveal 430 labeled examples (10 samples per class) and add them 
to the training set for the label predictor. Figure 7 shows the change of the validation 
error throughout the training. While the graph clearly suggests that our method can be 
beneficial in the semi-supervised setting, thorough verification of semi-supervised setting is 
left for future work. 

Offiee data set. We finally evaluate our method on Office data set, which is a collection of 
three distinct domains: Amazon, DSLR, and Webcam. Unlike previously discussed data 
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sets, Office is rather small-scale with only 2817 labeled images spread across 31 different 
categories in the largest domain. The amount of available data is crucial for a successful 
training of a deep model, hence we opted for the fine-tuning of the CNN pre-trained on 
the ImageNet (AlexNet from the Caffe package, see Jia et ah, 2014) as it is done in some 
recent DA works (Donahue et ah, 2014; Tzeng et ah, 2014; Hoffman et ah, 2013; Long and 
Wang, 2015). We make our approach more comparable with Tzeng et al. (2014) by using 
exactly the same network architecture replacing domain mean-based regularization with the 
domain classifier. 

Following previous works, we assess the performance of our method across three transfer 
tasks most commonly used for evaluation. Our training protocol is adopted from Gong 
et al. (2013); Chopra et al. (2013); Long and Wang (2015) as during adaptation we use 
all available labeled source examples and unlabeled target examples (the premise of our 
method is the abundance of unlabeled data in the target domain). Also, all source domain 
data are used for training. Under this “fully-transductive” setting, our method is able 
to improve previously-reported state-of-the-art accuracy for unsupervised adaptation very 
considerably (Table 3), especially in the most challenging Amazon ^ Webcam scenario 
(the two domains with the largest domain shift). 

Interestingly, in all three experiments we observe a slight over-fitting (performance on 
the target domain degrades while accuracy on the source continues to improve) as training 
progresses, however, it doesn’t ruin the validation accuracy. Moreover, switching off the 
domain classifier branch makes this effect far more apparent, from which we conclude that 
our technique serves as a regularizer. 

5.3 Experiments with Deep Image Descriptors for Re-Identification 

In this section we discuss the application of the described adaptation method to person 
re-identification (re-id) problem. The task of person re-identification is to associate people 
seen from different camera views. More formally, it can be defined as follows: given two 
sets of images from different cameras {probe and gallery) such that each person depicted in 
the probe set has an image in the gallery set, for each image of a person from the probe 
set find an image of the same person in the gallery set. Disjoint camera views, different 
illumination conditions, various poses and low quality of data make this problem difficult 
even for humans (e.^., Liu et ah, 2013, reports human performance at Rankl=71.08%). 

Unlike classification problems that are discussed above, re-identification problem implies 
that each image is mapped to a vector descriptor. The distance between descriptors is then 
used to match images from the probe set and the gallery set. To evaluate results of re-id 
methods the Cumulative Mateh Charaeteristie (CMC) curve is commonly used. It is a plot 
of the identification rate (recall) at rank-A;, that is the probability of the matching gallery 
image to be within the closest k images (in terms of descriptor distance) to the probe image. 

Most existing works train descriptor mappings and evaluate them within the same data 
set containing images from a certain camera network with similar imaging conditions. Sev¬ 
eral papers, however, observed that the performance of the resulting re-identification sys¬ 
tems drops very considerably when descriptors trained on one data set and tested on an¬ 
other. It is therefore natural to handle such cross-domain evaluation as a domain-adaptation 
problem, where each camera network (data set) constitutes a domain. 
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VIPER 


CUHK 


PRID 


Figure 8: Matching and non-matching pairs of probe-gallery images from different person 
re-identification data sets. The three data sets are treated as different domains 
in our experiments. 


Recently, several papers with significantly improved re-identification performance (Zhang 
and Saligrama, 2014; Zhao et ah, 2014; Paisitkriangkrai et ah, 2015) have been presented, 
with Ma et al. (2015) reporting good results in cross-data-set evaluation scenario. At the 
moment, deep learning methods (Yi et ah, 2014) do not achieve state-of-the-art results prob¬ 
ably because of the limited size of the training sets. Domain adaptation thus represents a 
viable direction for improving deep re-identification descriptors. 

5.3.1 Data Sets and Protocols 

Following Ma et al. (2015), we use PRID (Hirzer et ah, 2011), VIPeR (Gray et ah, 2007), 
CUHK (Li and Wang, 2013) as target data sets for our experiments. The PRID data set 
exists in two versions, and as in Ma et al. (2015) we use a single-shot variant. It contains 
images of 385 persons viewed from camera A and images of 749 persons viewed from camera 
B, 200 persons appear in both cameras. The VIPeR data set also contains images taken 
with two cameras, and in total 632 persons are captured, for every person there is one image 
for each of the two camera views. The CUHK data set consists of images from five pairs of 
cameras, two images for each person from each of the two cameras. We refer to the subset 
of this data set that includes the first pair of cameras only as CUHK/pl (as most papers 
use this subset). See Figure 8 for samples of these data sets. 

We perform extensive experiments for various pairs of data sets, where one data set 
serves as a source domain, i.e., it is used to train a descriptor mapping in a supervised 
way with known correspondences between probe and gallery images. The second data set is 
used as a target domain, so that images from that data set are used without probe-gallery 
correspondence. 

In more detail, CUHK/pl is used for experiments when CUHK serves as a target domain 
and two settings (“whole CUHK” and CUHK/pl) are used for experiments when CUHK 
serves as a source domain. Given PRID as a target data set, we randomly choose 100 persons 
appearing in both camera views as training set. The images of the other 100 persons from 
camera A are used as probe, all images from camera B excluding those used in training (649 
in total) are used as gallery at test time. For VIPeR, we use random 316 persons for training 
and all others for testing. For CUHK, 971 persons are split into 485 for training and 486 
for testing. Unlike Ma et al. (2015), we use all images in the first pair of cameras of CUHK 
instead of choosing one image of a person from each camera view. We also performed two 
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experiments with all images of the whole CUHK data set as source domain and VIPeR and 
PRID data sets as target domains as in the original paper (Yi et ah, 2014). 

Following Yi et al. (2014), we augmented our data with mirror images, and during 
test time we calculate similarity score between two images as the mean of the four scores 
corresponding to different flips of the two compared images. In case of CUHK, where there 
are 4 images (including mirror images) for each of the two camera views for each person, 
all 16 combinations’ scores are averaged. 

5.3.2 CNN architectures and Training Procedure 

In our experiments, we use Siamese architecture described in Yi et al. (2014) {Deep Metrie 
Learning or DML) for learning deep image descriptors on the source data set. This archi¬ 
tecture incorporates two convolution layers (with 7x7 and 5x5 filter banks), followed 
by ReLU and max pooling, and one fully-connected layer, which gives 500-dimensional de¬ 
scriptors as an output. There are three parallel flows within the CNN for processing three 
part of an image: the upper, the middle, and the lower one. The first convolution layer 
shares parameters between three parts, and the outputs of the second convolution layers 
are concatenated. During training, we follow Yi et al. (2014) and calculate pairwise cosine 
similarities between 500-dimensional features within each batch and backpropagate the loss 
for all pairs within batch. 

To perform domain-adversarial training, we construct a DANN architecture. The fea¬ 
ture extractor includes the two convolutional layers (followed by max-pooling and ReLU) 
discussed above. The label predictor in this case is replaced with deseriptor predietor that 
includes one fully-connected layer. The domain classifier includes two fully-connected layers 
with 500 units in the intermediate representation (t^500^1). 

For the verification loss function in the descriptor predictor we used Binomial Deviance 
loss, defined in Yi et al. (2014) with similar parameters: d = 2, /3 = 0.5, c = 2 (the 
asymmetric cost parameter for negative pairs). The domain classifier is trained with logistic 
loss as in subsection 5.2.2, 

We used learning rate fixed to 0.001 and momentum of 0.9. The schedule of adaptation 
similar to the one described in subsection 5.2.2 was used. We also inserted dropout layer 
with rate 0.5 after the concatenation of outputs of the second max-pooling layer. 128-sized 
batches were used for source data and 128-sized batches for target data. 

5.3.3 Results on Re-identification data sets 

Figure 9 shows results in the form of CMC-curves for eight pairs of data sets. Depending on 
the hardness of the annotation problem we trained either for 50,000 iterations (CUHK/pl 
^ VIPeR, VIPeR ^ CUHK/pl, PRID ^ VIPeR) or for 20,000 iterations (the other five 
pairs). 

After the sufficient number of iterations, domain-adversarial training consistently im¬ 
proves the performance of re-identification. For the pairs that involve PRID data set, which 
is more dissimilar to the other two data sets, the improvement is considerable. Overall, 
this demonstrates the applicability of the domain-adversarial learning beyond classification 
problems. 
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Figure 9: Results on VIPeR, PRID and CUHK/pl with and without domain-adversarial 
learning. Across the eight domain pairs domain-adversarial learning improves re¬ 
identification accuracy. For some domain pairs the improvement is considerable. 
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(a) DML (b) DML, adaptation 

Figure 10: The effect of adaptation shown by t-SNE visualizations of source and target 
domains descriptors in a VIPeR ^ CUHK/pl experiment pair. VIPeR is de¬ 
picted with green and CUHK/pl - with red. As in the image classification case, 
domain-adversarial learning ensures a closer match between the source and the 
target distributions. 


Figure 10 further demonstrates the effect of adaptation on the distributions of the 
learned descriptors in the source and in target sets in VIPeR ^ CUHK/pl experiments, 
where domain adversarial learning once again achieves better intermixing of the two do¬ 
mains. 


6. Conclusion 

The paper proposes a new approach to domain adaptation of feed-forward neural networks, 
which allows large-scale training based on large amount of annotated data in the source 
domain and large amount of unannotated data in the target domain. Similarly to many 
previous shallow and deep DA techniques, the adaptation is achieved through aligning the 
distributions of features across the two domains. However, unlike previous approaches, the 
alignment is accomplished through standard backpropagation training. 

The approach is motivated and supported by the domain adaptation theory of Ben-David 
et al. (2006, 2010). The main idea behind DANN is to enjoin the network hidden layer to 
learn a representation which is predictive of the source example labels, but uninformative 
about the domain of the input (source or target). We implement this new approach within 
both shallow and deep feed-forward architectures. The latter allows simple implementation 
within virtually any deep learning package through the introduction of a simple gradient 
reversal layer. We have shown that our approach is flexible and achieves state-of-the-art 
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results on a variety of benchmark in domain adaptation, namely for sentiment analysis and 
image classification tasks. 

A convenient aspect of our approach is that the domain adaptation component can be 
added to almost any neural network architecture that is trainable with backpropagation. 
Towards this end, We have demonstrated experimentally that the approach is not confined 
to classification tasks but can be used in other feed-forward architectures, e.^., for descriptor 
learning for person re-identification. 
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